In preparation for the technical interview, the candidate is asked to complete an exercise of Data Science with 48 hours of time. The idea is to simulate the final presentation that he/she would attend to an hypothetical internal stakeholder during the interview. The result of the exercise is thus used as basis for the presentation and to comment the choices from a technical point of view. For the sake of simplicity, it is highly recommended to ask the performance of the analysis with Jupyter Notebook, and the result to be sent as html file. Below the instructions of the exercise, the dataset has to be loaded from the file "dataset_reale.csv".
Introduction: the Data Science Center of Excellence has been asked by the business personnel to improve the activity of the Welfare office. In particular, the solution needs to explore the possibility to support the smoking customers against the onset of any cardio-respiratory diseases with a digital app. By accessing its historical database, the Welfare office produces a dataset to be provided for analysis.
Definition of the starting dataset:
First goal: An exploratory analysis of the dataset proposed by the candidate that enables the Welfare office to better understand the data it provided. Second goal: An algorithm that predicts the probability of contracting a cardio-respiratory disease on the basis of the data observed. Qualitative parameters should be provided as well in order to validate the chosen method.
Info: The explanations will be valued very positively.
# Load the data
import pandas as pd
data = pd.read_csv("reale_tech_ds/dataset_reale.csv")
# Head sample of the data
data.head(5)
| columns_0 | columns_1 | columns_2 | columns_3 | columns_4 | columns_5 | columns_6 | columns_7 | columns_8 | columns_9 | columns_10 | columns_11 | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.040916 | 0.339411 | 0.579828 | 25.173001 | 0.065054 | 55.154329 | 205.060967 | 1.413931 | 4.695189 | 0.551543 | 12.303658 | 7.071068 | 0 |
| 1 | 11.737973 | 0.296985 | 0.692965 | 28.001429 | 0.076368 | 70.710678 | 326.683333 | 1.415911 | 4.228499 | 0.763675 | 13.010765 | 7.071068 | 0 |
| 2 | 9.758074 | 0.311127 | 0.551543 | 8.485281 | 0.049497 | 62.225397 | 199.404112 | 1.401811 | 4.398204 | 0.466690 | 17.677670 | 8.485281 | 0 |
| 3 | 12.586501 | 0.466690 | 0.452548 | 2.121320 | 0.066468 | 15.556349 | 282.842712 | 1.407708 | 4.511341 | 0.650538 | 13.293607 | 7.071068 | 0 |
| 4 | 7.212489 | 0.431335 | 0.183848 | 2.474874 | 0.050912 | 24.041631 | 103.237590 | 1.400071 | 4.808326 | 0.721249 | 17.441967 | 7.071068 | 0 |
# Shape
data.shape
(5858, 13)
# Data types
data.dtypes
columns_0 float64 columns_1 float64 columns_2 float64 columns_3 float64 columns_4 float64 columns_5 float64 columns_6 float64 columns_7 float64 columns_8 float64 columns_9 float64 columns_10 float64 columns_11 float64 label int64 dtype: object
# Simple describers from pandas
describer = data.describe()
describer
| columns_0 | columns_1 | columns_2 | columns_3 | columns_4 | columns_5 | columns_6 | columns_7 | columns_8 | columns_9 | columns_10 | columns_11 | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5858.000000 | 5856.000000 | 5856.000000 | 5856.000000 | 5858.000000 | 5854.000000 | 5858.000000 | 5857.000000 | 5856.000000 | 5855.000000 | 5854.000000 | 5858.000000 | 5858.000000 |
| mean | 10.084987 | 0.456509 | 0.458569 | 8.053181 | 0.075232 | 44.896570 | 172.315694 | 1.406476 | 4.538341 | 0.735502 | 14.835989 | 8.249418 | 0.180608 |
| std | 1.709037 | 0.215407 | 0.196266 | 6.882227 | 0.045226 | 25.002479 | 76.319009 | 0.004269 | 0.223442 | 0.199040 | 1.694602 | 1.242915 | 0.384726 |
| min | 5.374012 | 0.113137 | 0.000000 | 0.848528 | 0.012728 | 1.414214 | 8.485281 | 1.395984 | 3.846661 | 0.311127 | 11.313708 | 4.242641 | 0.000000 |
| 25% | 9.050967 | 0.311127 | 0.353553 | 2.545584 | 0.052326 | 25.455844 | 126.218560 | 1.403070 | 4.384062 | 0.593970 | 13.435029 | 7.071068 | 0.000000 |
| 50% | 9.758074 | 0.395980 | 0.438406 | 5.091169 | 0.065054 | 42.426407 | 173.948268 | 1.406506 | 4.525483 | 0.707107 | 14.566400 | 8.485281 | 0.000000 |
| 75% | 10.748023 | 0.530330 | 0.551543 | 12.020815 | 0.082024 | 60.811183 | 224.859956 | 1.409745 | 4.681047 | 0.834386 | 15.980613 | 8.485281 | 0.000000 |
| max | 22.485996 | 2.234457 | 2.347595 | 93.055252 | 0.864084 | 408.707720 | 622.253967 | 1.469340 | 5.670996 | 2.800143 | 21.071782 | 12.727922 | 1.000000 |
# Easy data profile and basic statistics for each column
from pandas_profiling import ProfileReport
profile = ProfileReport(data, title = "Report")
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
# Columns with missing values
counter_describer = describer.loc["count"].to_frame()
counter_describer["number_of_nan"] = data.shape[0] - counter_describer["count"]
counter_describer["percentage_of_nan"] = counter_describer["number_of_nan"] / data.shape[0]
counter_describer.sort_values(by="number_of_nan", ascending=False)
| count | number_of_nan | percentage_of_nan | |
|---|---|---|---|
| columns_5 | 5854.0 | 4.0 | 0.000683 |
| columns_10 | 5854.0 | 4.0 | 0.000683 |
| columns_9 | 5855.0 | 3.0 | 0.000512 |
| columns_1 | 5856.0 | 2.0 | 0.000341 |
| columns_2 | 5856.0 | 2.0 | 0.000341 |
| columns_3 | 5856.0 | 2.0 | 0.000341 |
| columns_8 | 5856.0 | 2.0 | 0.000341 |
| columns_7 | 5857.0 | 1.0 | 0.000171 |
| columns_0 | 5858.0 | 0.0 | 0.000000 |
| columns_4 | 5858.0 | 0.0 | 0.000000 |
| columns_6 | 5858.0 | 0.0 | 0.000000 |
| columns_11 | 5858.0 | 0.0 | 0.000000 |
| label | 5858.0 | 0.0 | 0.000000 |
# Total missing values
counter_describer["number_of_nan"].sum()
20.0
# Target
import plotly.express as px
target_analysis = data.groupby(["label"]) \
.size().to_frame() \
.reset_index() \
.rename(columns={0:"count"})
fig = px.bar(target_analysis, x='label', y='count')
fig.show()
target_analysis["percentage"] = target_analysis["count"] / target_analysis["count"].sum()
target_analysis
| label | count | percentage | |
|---|---|---|---|
| 0 | 0 | 4800 | 0.819392 |
| 1 | 1 | 1058 | 0.180608 |
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop("label", axis=1), data["label"],
test_size=0.30,
random_state=42)
# Test set balance
y_test.value_counts()
0 1480 1 278 Name: label, dtype: int64
# Training set balance
y_train.value_counts()
0 3320 1 780 Name: label, dtype: int64
There very few nan values, therefore I decided to drop all nan values from the dataset.
training_data = X_train.copy()
training_data["label"] = y_train
training_data = training_data.dropna()
training_data.shape, X_train.shape
((4087, 13), (4100, 12))
X_train = training_data.drop("label", axis=1)
y_train = training_data["label"]
# SMOTE Oversampling
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)
y_resampled.value_counts()
1 3310 0 3310 Name: label, dtype: int64
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
# Pipeline for preprocessing
numeric_features = X_resampled.columns
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
#('scaler', StandardScaler())
])
# Preprocessor definition
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
])
# Models to be tested in grid search
clf1 = RandomForestClassifier(random_state=42)
clf2 = GradientBoostingClassifier(random_state=42)
# Pipeline definition
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('feature_selection', SelectFromModel(RandomForestClassifier(random_state=10),
threshold="median")),
('classifier', clf1)
])
# Parameters grid for each classifier
param1= {"preprocessor__num__imputer__strategy": ['mean'],
"classifier__n_estimators":[50],
"classifier__min_samples_leaf":[1],
"classifier__max_depth":[None,10],
"classifier__min_samples_split":[2],
"classifier":[clf1]}
param2= {"preprocessor__num__imputer__strategy": ['mean'],
"classifier__learning_rate":[0.1,0.05],
"classifier__min_samples_leaf":[1],
"classifier__max_depth":[None,5],
"classifier__min_samples_split":[2],
"classifier__n_estimators":[10],
"classifier":[clf2]}
# Define all parameters for the grid search
params = [param1, param2]
# Define the grid search
search = GridSearchCV(pipe,
params,
cv=5,
n_jobs=-1)
# Train the model
search.fit(X_resampled,y_resampled)
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
Index(['columns_0', 'columns_1', 'columns_2', 'columns_3', 'columns_4',
'columns_5', 'columns_6', 'columns_7', 'columns_8', 'columns_9',
'columns_10', 'columns_11'],
dtype='object'))])),
('feature_sele...
'classifier__n_estimators': [50],
'preprocessor__num__imputer__strategy': ['mean']},
{'classifier': [GradientBoostingClassifier(random_state=42)],
'classifier__learning_rate': [0.1, 0.05],
'classifier__max_depth': [None, 5],
'classifier__min_samples_leaf': [1],
'classifier__min_samples_split': [2],
'classifier__n_estimators': [10],
'preprocessor__num__imputer__strategy': ['mean']}])
# Get the bets model
best_model = search.best_estimator_
best_model
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer())]),
Index(['columns_0', 'columns_1', 'columns_2', 'columns_3', 'columns_4',
'columns_5', 'columns_6', 'columns_7', 'columns_8', 'columns_9',
'columns_10', 'columns_11'],
dtype='object'))])),
('feature_selection',
SelectFromModel(estimator=RandomForestClassifier(random_state=10),
threshold='median')),
('classifier',
RandomForestClassifier(n_estimators=50, random_state=42))])
from sklearn.linear_model import LogisticRegression
logistic_clf = LogisticRegression(random_state=42)
train_prob = best_model.predict_proba(X_train)[:,1].reshape(-1,1)
logistic_clf.fit(train_prob, y_train)
LogisticRegression(random_state=42)
y_pred = best_model.predict(X_test)
y_pred_score = best_model.predict_proba(X_test)
y_pred_prob = logistic_clf.predict_proba(best_model.predict_proba(X_test)[:,1].reshape(-1,1))
# Confusion matrix
%matplotlib inline
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, plot_roc_curve, roc_auc_score, PrecisionRecallDisplay, auc
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred, labels=best_model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=best_model.classes_)
disp.plot()
plt.show()
from sklearn.metrics import precision_score, recall_score
print(f"Precision: {round(precision_score(y_test, y_pred),2)}; Recall: {round(recall_score(y_test, y_pred),2)}.")
Precision: 0.97; Recall: 0.98.
# Plot ROC curve
plot_roc_curve(best_model, X_test, y_test)
/home/operti/miniconda3/envs/reale/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function `plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: RocCurveDisplay.from_predictions or RocCurveDisplay.from_estimator.
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f0ef81abeb0>
# Calculate AUC under ROC curve
roc_auc = roc_auc_score(y_test, y_pred_prob[:,1])
roc_auc
0.9992271048026444
display = PrecisionRecallDisplay.from_estimator(
best_model, X_test, y_test)
# Calculate AUC under Precision-Recall curve
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
metrics.auc(fpr, tpr)
0.9883044915419017
# Calculating the F1-score
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
f1
0.9767441860465116
import sklearn
import warnings
def get_feature_names(column_transformer):
def get_names(trans):
if trans == 'drop' or (
hasattr(column, '__len__') and not len(column)):
return []
if trans == 'passthrough':
if hasattr(column_transformer, '_df_columns'):
if ((not isinstance(column, slice))
and all(isinstance(col, str) for col in column)):
return column
else:
return column_transformer._df_columns[column]
else:
indices = np.arange(column_transformer._n_features)
return ['x%d' % i for i in indices[column]]
if not hasattr(trans, 'get_feature_names'):
if column is None:
return []
else:
return [name + "__" + f for f in column]
return [name + "__" + f for f in trans.get_feature_names()]
feature_names = []
if type(column_transformer) == sklearn.pipeline.Pipeline:
l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
else:
l_transformers = list(column_transformer._iter(fitted=True))
for name, trans, column, _ in l_transformers:
if type(trans) == sklearn.pipeline.Pipeline:
_names = get_feature_names(trans)
if len(_names)==0:
_names = [name + "__" + f for f in column]
feature_names.extend(_names)
else:
feature_names.extend(get_names(trans))
return feature_names
def get_features_importance(model):
features_input = list(model.named_steps["feature_selection"] \
.get_feature_names_out(input_features=get_feature_names(model \
.named_steps["preprocessor"])))
feature_importance = model.named_steps["classifier"] \
.feature_importances_
return features_input, pd.DataFrame([features_input,feature_importance]).T \
.rename(columns={0:"feature_name", 1:"features_importance"}) \
.sort_values(by="features_importance", ascending=False)
features, feature_importance = get_features_importance(best_model)
feature_importance
| feature_name | features_importance | |
|---|---|---|
| 3 | num__columns_6 | 0.339246 |
| 2 | num__columns_4 | 0.325359 |
| 0 | num__columns_1 | 0.175046 |
| 5 | num__columns_9 | 0.07457 |
| 4 | num__columns_7 | 0.056843 |
| 1 | num__columns_3 | 0.028936 |
import shap
explainer = shap.TreeExplainer(best_model["classifier"])
df_for_shap = pd.DataFrame(best_model["feature_selection"].transform(best_model["preprocessor"].fit_transform(X_test)),
columns=features)
shap_values = explainer.shap_values(df_for_shap)
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [50, 100]
shap.summary_plot(shap_values[1], df_for_shap.values, feature_names = df_for_shap.columns,plot_size=(10,10))